A cross-language focused crawling algorithm based on multiple relevance prediction strategies
نویسندگان
چکیده
منابع مشابه
A Review of Focused Web Crawling Strategies
Modern world with tons of competition also brings a sense of responsibility of preserving the valuable time of user in case of searching for information around the web. But the abundance of data indexed is quite huge and with different user perspective, searching has a significant impact using a standard exhaustive crawling. A standard crawler starts well with a promising set of initial seed UR...
متن کاملEvolving Strategies for Focused Web Crawling
The rapid growth of the World Wide Web has created many challenges for both general purpose crawling, search engines and web directories, making it difficult to find, index, and classify web pages based on a topic. Topic driven crawlers can complement search engines because they pre-classify the pages retrieved by the crawl. To implement such a focused crawler, a strategy for ordering the crawl...
متن کاملLanguage Specific and Topic Focused Web Crawling
We describe an experiment on collecting large language and topic specific corpora automatically by using a focused Web crawler. Our crawler combines efficient crawling techniques with a common text classification tool. Given a sample corpus of medical documents, we automatically extract query phrases and then acquire seed URLs with a standard search engine. Starting from these seed URLs, the cr...
متن کاملFocused Crawling System based on Improved LSI
In this research work we have developed a semi-deterministic algorithm and a scoring system that takes advantage of the Latent Semantic indexing scoring system for crawling web pages that belong to particular domain or is specific to the topic .The proposed algorithm calculates a preference factor in addition to the LSI score to determine which web page needs to preferred for crawling by the mu...
متن کاملFocused crawling for both relevance and quality of medical information
Subject-specific search facilities on health sites are usually built using manual inclusion and exclusion rules. These can be expensive to maintain and often provide incomplete coverage of Web resources. On the other hand, health information obtained through whole-of-Web search may not be scientifically based and can be potentially harmful. To address problems of cost, coverage and quality, we ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Computers & Mathematics with Applications
سال: 2009
ISSN: 0898-1221
DOI: 10.1016/j.camwa.2008.09.021